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Starting from the pioneering works on software architecture precious guidelines have emerged to 
indicate how computer programs should be organized. For example the "separation of concerns" 
suggests to split a program into modules that overlap in functionality as little as possible. However 
these recommendations are mainly conceptual and are thus hard to express in a quantitative form. 
Hence software architecture relies on the individual experience and skill of the designers rather 
than on quantitative laws. In this article I apply the methods developed for the classification of 
information on the World- Wide- Web to study the organization of Open Source programs in an 
attempt to establish the statistical laws governing software architecture. 
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The rapid increase in the size of software systems cre- 
ates new challenges for the design and maintenance of 
computer software. Modern systems are constructed 
from many components forming a complex interdepen- 
dent network. Starting from pioneering works in the early 
1970s software architecture has developed in a ma- 
ture field that provides precious guidelines for efficient 
software development d 0] . However these recommen- 
dations are mainly conceptual and are thus difficult to 
express in a quantitative form. In this article I construct 
the network formed by procedure calls in several open 
source programs with emphasis on the code of the Linux 
kernel [7]. The obtained networks have scale- free prop- 
erties similar to hyperlinks on the World- Wide- Web and 
other types of scale- free networks (il-[l2|. Thus proce- 
dures can be ordered efficiently using the link analysis 
algorithms developed for web-pages 

[11111. This allows 
to find automatically the important elements in the struc- 
ture of a program and to propose a quantitative criterion 
characterizing well organized software architectures. Fi- 
nally I analyze the spectral properties of the transition 
matrix between the procedures and compare it with re- 
cent results for other networks [l5| . 

In order to analyze quantitatively the network proper- 
ties of computer code, I study several open source pro- 
grams written in the C programming language [l6[. In 
this widespread language the code is structured as a se- 
quence of procedures calling each other, thus the orga- 
nization of a program can be naturally represented as 
a procedure call network (PCN) where each node rep- 
resents a procedure and each oriented edge corresponds 
to a procedure call. This network is built by scanning 
lexically the source code of a project, identifying all the 
defined procedures. For each of them a list keeps track of 
the procedures calls inside their definition. An example 
of the obtained network for a toy code with two proce- 
dures is shown on Fig. HJ 

The out/in-degrees of a node i in this network are 
noted v(i) and v(i) respectively. The values of these 
numbers for the toy code are also given on Fig. [TJ they 
correspond to the number of out /in-going calls for each 
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FIG. 1: The diagram in the center represents the PCN of a 
toy kernel with two procedures written in the C programming 
language. The graph on the left/right shows the out/in de- 
gree probability distribution P ou t(p) / Pin(y) ■ The colors cor- 
respond to different Kernel releases. The most recent version 
2.6.32 with N = 285509 and an average 3.18 calls per pro- 
cedure is represented in red. Older versions (2.4.37.6, 2.2.26, 
2.0.40, 1.2.12, 1.0) with N respectively equal to (85756, 38766, 
14079, 4358, 2751) follow the same behavior. The dashed 
curve shows the out-degree probability distribution if only 
calls to distinct destination procedures are kept. 



procedure. A network is called scale- free, when the dis- 
tributions of the degrees v and v are characterized by 
power-law tails. Many networks in nature and in com- 
puter science fall in this class, for example this is the case 
for the World- Wide- Web (WWW) 0, [SHand for the 
package dependencies in Linux distributions [21]. The de- 
gree distributions for the PCN of several releases of the 
Linux kernel are presented on Fig. [U They show unam- 
biguously that PCN is a scale- free network with proper- 
ties similar to WWW. Indeed the decay of the probability 
distribution Pi n (v) of in-going calls is well described by 
the power law Pi n {y) cx v~ lin with j in = 2.0 ±0.02. The 
probability distribution of out-going calls also follows a 
power law P ou t(v) cx v~ lout with j out = 3.0 ±0.1. These 
values are close to the exponents found in the WWW 
where 7^ n = 2.1 and j ou t ~ 2.7 H [H[. In the above 
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distributions all procedure calls were included, if only 
calls to distinct functions are counted in the out-degree 
distribution the exponent drops to 7 on ^ ~ 5 whereas 7^ 
remains unchanged. It should be stressed that the dis- 
tributions for the different kernel releases remain stable 
even if the network size increases from N = 2751 for 
version 1.0 to N — 285509 for the latest 2.6.32 version. 

This similarity between PCN and WWW networks can 
be attributed to important development constraints that 
exist for both networks. Indeed WWW was designed 
as an information sharing system where users can easily 
access and create entries. The same principle applies 
also for Open Source development where the project is 
advanced by a loosely-knitted programmer community. 

Due to this similarity it is natural to apply the methods 
developed to organize information on the WWW to the 
PCN. PageRank is probably the most successful known 
link analysis algorithm [T^j. It is based on the construc- 
tion of the Google matrix : 



dj = aSij + (l-a)/N 



(1) 



where the matrix S is constructed by normalizing to 
unity all columns of the adjacency matrix, and replacing 
columns with zero elements by 1/iV, N being the net- 
work size [12]. The damping parameter a, in the WWW 
context describes the probability to jump to any node for 
a random surfer. For PCN this parameter can describe 
the probability to modify a global variable that affects 
the overall code behavior. The value a « 0.85 seems to 
give a good classification [l2| for WWW, thus I also used 
this value for PCN. The matrix G belongs to the class 
of Perron- Frobenius operators. Its largest eigenvalue is 
A = 1 and other eigenvalues have |A| < a. The right 
eigenvector at A = 1 gives the probability p(i) to find a 
random surfer at site z; it is called the PageRank vector. 
Once the PageRank is found, WWW sites are sorted by 
decreasing p(i), the site rank in this index K(i) reflects 
the site relevance. 

The PageRank p for the Linux PCN is shown on Fig. [2] 
as a function of rank K. The decay of p(K) is well de- 
scribed by a power-law p(K) oc with /3 ~ 1, this 
value is consistent with the relation f3 = l/(ji n — 1) 
which would be exact if the PageRank of a procedure 
was proportional to its in-degree v. It is known that 
for WWW this proportionality is qualitatively valid [HI 
although the PageRank classification introduces signifi- 
cant mixing compared to a classification based only on 
the in-degree distribution. The inset on Fig. [2] illustrates 
that this mixing exists also for PCN, hence PageRank 
classification for procedures is expected to be more in- 
formative and stable as in WWW. Fig. [2] also reports 
the three procedures with the highest PageRank in the 
Linux Kernel. These popular procedures perform well 
defined tasks which may be useful in any part of the 
code: for example printkQ reports system messages and 
memsetQ, kfreeQ intervene in memory allocation. 
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FIG. 2: PageRank p and influence- PageRank p* as a function 
of the ranks K (for p) and K* (for p*) for the PCN of the 
Linux Kernel, release 2.6.32. The procedures with highest p 
and p* are given on the left. The inset illustrates the correla- 
tion between p and the in-degree v\ procedures are serpentine 
ordered from low p at the bottom to high p on the top, while 
the color code follows the value of the in-degree. 



Although these procedures with high PageRank take 
care of highly useful tasks, their role in the overall pro- 
gram structure is limited. This suggests the existence of 
another complementary classification reflecting the pro- 
cedure influence on the code organization. In the Hubs 
and Authority algorithm [14] proposed in the WWW 
context, the sites are characterized by two ranks reflect- 
ing their "hubness" (influence) and "authority" (popu- 
larity). However this method is less stable than PageR- 
ank and is generally used for small subnetworks [I3.Tl9| . 
Hence I apply an alternative approach which is still 
based on the PageRank algorithm. It consists in invert- 
ing the direction of links in the adjacency matrix before 
the construction of the Google matrix. This transposed 
adjacency matrix describes the flow of information re- 
turned from the called procedures to their parents. I 
will call influence-PageRank p*(i) the PageRank vector 
of this modified Google matrix, the procedures can now 
be sorted according to their influence p*(i) yielding a 
new rank K*(i). The dependence of p*(i) on K*(i) for 
the Linux Kernel code is presented on Fig. [2] Again 
the decay is well described by p*(K*) oc K*~^ where 
/?* « ^ I id out — 1) ~ 1/2. In this classification, the 
first procedures fulfill an important organizational role: 
e.g. start -kernelQ initializes the Kernel and manages 
the repartition of tasks. 

The correlation between popular and influential pro- 
cedures in the PCN network is described by the joint 
probability distribution P(p,p*) that gives the probabil- 
ity of finding a procedure i with (p(i), p*(i)) in a small 
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FIG. 3: The left panel b) represents the joint probability dis- 
tribution P(p, p*) as a function of log p and log p* for the PCN 
of the Linux Kernel, release 2.6.32. Regions with low prob- 
ability are colored in Black/Red, while high probability are 
colored in Blue/Green. The panel c) shows the product prob- 
ability p(p)p*(p*) on the same scale, it reproduces P(p, p*) 
with a high fidelity. The panel a) shows the value of the 
correlator Kj as a function of PCN size N for the Linux Ker- 
nel releases from Fig. [T] The two panels d) and e) compare 
the joint probability distribution P(p,p*) with the product 
probability p(p)p*(p*) for the Cambrdige University WWW 
network. The correlated structure along the diagonal p = p* 
which is present in panel d) is not reproduced on panel e). 



area around (p, p*). This distribution is displayed on 
Fig. [3] where it is compared with the distribution that 
is obtained under the assumption that p and p* are in- 
dependent quantities. This distribution stems from the 
product of probabilities p(p) and p*(p*) to find a proce- 
dure in an interval around p and p* respectively so that 
P = p(p)p*(p*)- These two distributions are very simi- 
lar, showing that the popularity and influence are weakly 
correlated in the PCN network. The direct computation 
of the correlator n : 

« = JVj>(i)p*(i)-l (2) 

i 

supports this assumption of independence. Indeed it was 
found that \k\ <C 1 for the PCN of the Linux Kernel for 
all releases. For most releases this correlator is negative 
indicating a certain anti-correlation between popular and 
influential procedures. These observation hold also for 
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FIG. 4: Distribution of eigenvalues A in the complex plane for 
the Google matrix of the Linux Kernel 2.0.40 with N = 14079. 
Circles highlight the ring region 0.1 < |A| < 1. 



other OpenSource software including Gimp 2.6.8 (k = 
-0.068,7V = 17540) and X Windows server R7.1-1.1.0 
= -0.027, N= 14887). 

This absence of correlations between popularity and in- 
fluence in PCN contrasts with the WWW hyperlink net- 
work. In the latter case, the correlator is positive and of 
order unity: this was confirmed by analyzing hyperlinks 
for several UK universities available at [22|. For exam- 
ple, I find for the web sites of Universities at Cambridge 
(k = 3.79, N = 376836), Oxford (k = 1.52, N = 331955), 
Bath (k = 7.22, N = 112143) and Hull (k = 2.09,7V = 
21061). Note that the typical vale of n does not di- 
rectly depend on the network size. The joint probability 
P(p,p*) and the product probability p(p)p*(p*) for the 
Cambrdige University network are compared on Fig. 03 
The product probability reproduces to some extent the 
behavior of P(p,p*) but fails to capture the correlations 
along the diagonal p = p* as expected from the positive 
value of the correlator n = 3.79. 

The above observations suggest that the independence 
between popular procedures, fulfilling important but well 
defined tasks, and influential procedures, which organize 
and assign tasks in the code, is an important ingredi- 
ent of well structured software. The heuristic content of 
this independence criterion is related to the well-known 
concept of "separation of concerns" [4] in software archi- 
tecture. The correlation coefficient n allows to express 
this concept in a quantitative way. Procedures that have 
high values of both p(i) and p*(i) can therefore play a 
critical role since they are popular and influential at the 
same time. For example in the Linux Kernel, do.forkQ) 
that creates new processes belongs to this class. These 
critical procedures may introduce subtle errors because 
they entangle otherwise independent segments of code. 

The eigenvalues of the matrix G provide information 
on the relaxation rates to the PageRank. Eigenvalues 
with |A| close to unity, represent independent compo- 
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nent weakly connected with the rest of the network. The 
WWW has a significant number of such modes lilflBllioj 



showing the existence of many independent communities. 
A typical eigenvalue distribution in the complex plane for 
Linux PCN is shown on Fig. HI The proportion of modes 
with |A| > 0.1 is very small (around 1% for network size 
N = 14079) compared to the case of University networks 
[l5| where this percentage is around 50% (for example for 
the Liverpool John Moores University with N = 13578). 
This result can be interpreted as follows: the web con- 
tains many quasi-independent communities whereas the 
PCN must ensure a strong coordination between the dif- 
ferent procedures that therefore must be able to exchange 
information. 

The presented studies demonstrate close similarities 
between software architecture and scale- free networks es- 
pecially with the World- Wide- Web. However they show 
that these networks have also substantial differences: the 
absence of correlation between popularity and influence 
in procedure call networks, and a large number of vanish- 
ing eigenvalues in the Google matrix which indicates on 
the small number of independent communities in com- 
puter codes. The properties of software networks found 
here may lay the foundation for a quantitative descrip- 
tion of functional software architectures. The proposed 
methods can be generalized to object oriented program- 
ming and may find several applications in software devel- 
opment. Possible applications include indications for the 
conception of code documentation and improvements in 
code refactoring techniques. Finally the identification of 
critical procedures may facilitate the correction of subtle 
errors that arise due to unintended entanglement in the 
code. 

I thank T.C. Phan for fruitful discussions on software 
development and acknowledge DGA for support. 
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